Goto

Collaborating Authors

 robot pose


Mobi-$π$: Mobilizing Your Robot Learning Policy

Yang, Jingyun, Huang, Isabella, Vu, Brandon, Bajracharya, Max, Antonova, Rika, Bohg, Jeannette

arXiv.org Artificial Intelligence

Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. We propose a novel approach for policy mobilization that bridges navigation and manipulation by optimizing the robot's base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. To understand policy mobilization in more depth, we also introduce the Mobi-$π$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, and (3) visualization tools for analysis. In both our developed simulation task suite and the real world, we show that our approach outperforms baselines, demonstrating its effectiveness for policy mobilization.


ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

Chen, Jason, Liu, I-Chun Arthur, Sukhatme, Gaurav, Seita, Daniel

arXiv.org Artificial Intelligence

Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.


Calib3R: A 3D Foundation Model for Multi-Camera to Robot Calibration and 3D Metric-Scaled Scene Reconstruction

Allegro, Davide, Terreran, Matteo, Ghidoni, Stefano

arXiv.org Artificial Intelligence

RELATED WORKS Hand-Eye Calibration: Hand-eye calibration is a well-established problem in robotics that aims to estimate the relative pose between a camera and a robot's end-effector. It is typically addressed by capturing a series of images of a known calibration pattern (e.g., a checkerboard) using a camera rigidly mounted on the robot hand, and using both the images and the corresponding robot poses to compute the camera's extrinsic parameters. Different mathematical formulations exist for solving hand-eye calibration; a widely adopted approach involves solving the equation AX = XB, where X is the unknown rigid transformation describing the pose of the camera with respect to the robot, while A and B denote the relative motions of the end-effector (from robot kinematics) and the camera (from pattern observations), respectively [31], [36]-[38]. Several other approaches were proposed: Shah [39] formulated a closed-form solution for the hand-eye problem by using an algorithm based on Singular V alue Decomposition (SVD) and the Kronecker product to solve for rotation and translation separately, while Li et al. [40] used dual quaternions to solve them simultaneously overcoming the limitations of the Kronecker product. Wang et al. [23] extended hand-eye calibration to multi-camera setups by incorporating a common reference frame but required an external motion capture system, limiting its applicability to small setups. Andreff and Heller [41], [42] proposed two similar hand-eye calibration methods that leverage the Structure-from-Motion (SfM) paradigm to estimate camera motion and introduce a formulation for hand-eye calibration that includes a factor to metrically scale camera poses.


Tree-SLAM: semantic object SLAM for efficient mapping of individual trees in orchards

Rapado-Rincon, David, Kootstra, Gert

arXiv.org Artificial Intelligence

Accurate mapping of individual trees is an important component for precision agriculture in orchards, as it allows autonomous robots to perform tasks like targeted operations or individual tree monitoring. However, creating these maps is challenging because GPS signals are often unreliable under dense tree canopies. Furthermore, standard Simultaneous Localization and Mapping (SLAM) approaches struggle in orchards because the repetitive appearance of trees can confuse the system, leading to mapping errors. To address this, we introduce Tree-SLAM, a semantic SLAM approach tailored for creating maps of individual trees in orchards. Utilizing RGB-D images, our method detects tree trunks with an instance segmentation model, estimates their location and re-identifies them using a cascade-graph-based data association algorithm. These re-identified trunks serve as landmarks in a factor graph framework that integrates noisy GPS signals, odometry, and trunk observations. The system produces maps of individual trees with a geo-localization error as low as 18 cm, which is less than 20% of the planting distance. The proposed method was validated on diverse datasets from apple and pear orchards across different seasons, demonstrating high mapping accuracy and robustness in scenarios with unreliable GPS signals. Keywords: semantic SLAM, agricultural robotics, multi-object tracking, factor graph 1. Introduction A significant decline in available agricultural labor presents a challenge for sustaining agricultural production, potentially leading to food losses [1, 2]. Automation and robotics are emerging as key technologies to address these issues, offering the potential to enhance productivity, by compensating for labor scarcity and optimizing farm management through data-driven insights [3, 4]. This is particularly relevant in high-value crops such as those found in orchards, where precise operations have the potential to improve efficiency and reduce labor needs. For autonomous robots to perform tasks effectively in orchards, such as targeted spraying or individual tree monitoring, they require a detailed map of the environment and the ability to determine their position within it.


Understanding and Mitigating Network Latency Effect on Teleoperated-Robot with Extended Reality

Zhang, Ziliang, Liu, Cong, Kim, Hyoseung

arXiv.org Artificial Intelligence

Robot teleoperation with extended reality (XR teleoperation) enables intuitive interaction by allowing remote robots to mimic user motions with real-time 3D feedback. However, existing systems face significant motion-to-motion (M2M) latency--the delay between the user's latest motion and the corresponding robot feedback--leading to high teleoperation error and mission completion time. This issue stems from the system's exclusive reliance on network communication, making it highly vulnerable to network degradation. To address these challenges, we introduce TeleXR, the first end-to-end, fully open-sourced XR teleoperation framework that decouples robot control and XR visualization from network dependencies. TeleXR leverages local sensing data to reconstruct delayed or missing information of the counterpart, thereby significantly reducing network-induced issues. This approach allows both the XR and robot to run concurrently with network transmission while maintaining high robot planning accuracy. TeleXR also features contention-aware scheduling to mitigate GPU contention and bandwidth-adaptive point cloud scaling to cope with limited bandwidth.


DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets

Makarova, Maria, Liu, Qian, Tsetserukou, Dzmitry

arXiv.org Artificial Intelligence

Diffusion models have proven to be a powerful tool in the field of generative artificial intelligence successfully applied in image synthesis, video generation and audio generation [1, 2, 3, 4, 5]. Using an iterative denoising approach, these models learn to invert a diffusion process, transforming random noise into sophisticated, high-quality samples. Reinforcement Learning (RL) and Imitation Learning (IL) have become particularly popular in the field of robot learning for the tasks of perceiving the environment and making decisions to perform actions in recent years [6]. But RL approach is highly dependent on the correct tuning of hyper-parameters [7], and effective IL training requires a large amount of diverse high-quality data [8]. Also, the multimodal nature of complex robot tasks hinders the construction of stable control. More recently, researchers have begun to integrate an approach in the form of diffusion policy learning into the field of robotics as well. The concept of diffusion policy was first introduced by Chi et al. [9]. The diffusion process has been applied to robot action sequence generation since such models are able to capture the complex mul-timodal distributions that are characteristic of many robotics tasks, as mentioned above.


Occupancy-SLAM: An Efficient and Robust Algorithm for Simultaneously Optimizing Robot Poses and Occupancy Map

Wang, Yingyu, Zhao, Liang, Huang, Shoudong

arXiv.org Artificial Intelligence

Joint optimization of poses and features has been extensively studied and demonstrated to yield more accurate results in feature-based SLAM problems. However, research on jointly optimizing poses and non-feature-based maps remains limited. Occupancy maps are widely used non-feature-based environment representations because they effectively classify spaces into obstacles, free areas, and unknown regions, providing robots with spatial information for various tasks. In this paper, we propose Occupancy-SLAM, a novel optimization-based SLAM method that enables the joint optimization of robot trajectory and the occupancy map through a parameterized map representation. The key novelty lies in optimizing both robot poses and occupancy values at different cell vertices simultaneously, a significant departure from existing methods where the robot poses need to be optimized first before the map can be estimated. Evaluations using simulations and practical 2D laser datasets demonstrate that the proposed approach can robustly obtain more accurate robot trajectories and occupancy maps than state-of-the-art techniques with comparable computational time. Preliminary results in the 3D case further confirm the potential of the proposed method in practical 3D applications, achieving more accurate results than existing methods.


Grid-based Submap Joining: An Efficient Algorithm for Simultaneously Optimizing Global Occupancy Map and Local Submap Frames

Wang, Yingyu, Zhao, Liang, Huang, Shoudong

arXiv.org Artificial Intelligence

Optimizing robot poses and the map simultaneously has been shown to provide more accurate SLAM results. However, for non-feature based SLAM approaches, directly optimizing all the robot poses and the whole map will greatly increase the computational cost, making SLAM problems difficult to solve in large-scale environments. To solve the 2D non-feature based SLAM problem in large-scale environments more accurately and efficiently, we propose the grid-based submap joining method. Specifically, we first formulate the 2D grid-based submap joining problem as a non-linear least squares (NLLS) form to optimize the global occupancy map and local submap frames simultaneously. We then prove that in solving the NLLS problem using Gauss-Newton (GN) method, the increments of the poses in each iteration are independent of the occupancy values of the global occupancy map. Based on this property, we propose a poseonly GN algorithm equivalent to full GN method to solve the NLLS problem. The proposed submap joining algorithm is very efficient due to the independent property and the pose-only solution. Evaluations using simulations and publicly available practical 2D laser datasets confirm the outperformance of our proposed method compared to the state-of-the-art methods in terms of efficiency and accuracy, as well as the ability to solve the grid-based SLAM problem in very large-scale environments.


Differentiable Robot Rendering

Liu, Ruoshi, Canberk, Alper, Song, Shuran, Vondrick, Carl

arXiv.org Artificial Intelligence

Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.


Combining Planning and Diffusion for Mobility with Unknown Dynamics

Ravan, Yajvan, Yang, Zhutian, Chen, Tao, Lozano-Pérez, Tomás, Kaelbling, Leslie Pack

arXiv.org Artificial Intelligence

Manipulation of large objects over long horizons (such as carts in a warehouse) is an essential skill for deployable robotic systems. Large objects require mobile manipulation which involves simultaneous manipulation, navigation, and movement with the object in tow. In many real-world situations, object dynamics are incredibly complex, such as the interaction of an office chair (with a rotating base and five caster wheels) and the ground. We present a hierarchical algorithm for long-horizon robot manipulation problems in which the dynamics are partially unknown. We observe that diffusion-based behavior cloning is highly effective for short-horizon problems with unknown dynamics, so we decompose the problem into an abstract high-level, obstacle-aware motion-planning problem that produces a waypoint sequence. We use a short-horizon, relative-motion diffusion policy to achieve the waypoints in sequence. We train mobile manipulation policies on a Spot robot that has to push and pull an office chair. Our hierarchical manipulation policy performs consistently better, especially when the horizon increases, compared to a diffusion policy trained on long-horizon demonstrations or motion planning assuming a rigidly-attached object (success rate of 8 (versus 0 and 5 respectively) out of 10 runs). Importantly, our learned policy generalizes to new layouts, grasps, chairs, and flooring that induces more friction, without any further training, showing promise for other complex mobile manipulation problems. Project Page: https://yravan.github.io/plannerorderedpolicy/